[SPARK-13763] [SQL] Remove Project when its Child's Output is Nil#11599
[SPARK-13763] [SQL] Remove Project when its Child's Output is Nil#11599gatorsmile wants to merge 3 commits intoapache:masterfrom
Conversation
| test("Eliminate the Project with an empty projectList") { | ||
| val input = OneRowRelation | ||
| val query = | ||
| Project(Literal(1).as("1") :: Nil, Project(Literal(1).as("1") :: Nil, input)).analyze |
There was a problem hiding this comment.
Where do you test empty projectList?
There was a problem hiding this comment.
When running Optimize.execute(query), the second Project's projectList is pruned to empty at first. Then, the second Project will be removed.
There was a problem hiding this comment.
Let me add another case with an empty List too.
|
Test build #52721 has finished for PR 11599 at commit
|
|
Test build #52724 has finished for PR 11599 at commit
|
|
@cloud-fan Added another two cases. Feel free to let me know if you want me to add more cases. Thanks! |
| } | ||
|
|
||
| // Eliminate the Projects with empty projectList | ||
| case p @ Project(projectList, child) if projectList.isEmpty => child |
There was a problem hiding this comment.
I'm thinking of the correctness of this rule. Actually this is not column pruning, but add more columns, as child may have more one columns.
And why this rule case p @ Project(projectList, child) if sameOutput(child.output, p.output) => child can't work?
There was a problem hiding this comment.
Because OneRowRelation has no output. So its output is different to its parent Project.
There was a problem hiding this comment.
But a Project with empty projectList also has no output right?
There was a problem hiding this comment.
case p @ Project(_, l: LeafNode) => p There is another case above it. Thus, it will stop here.
There was a problem hiding this comment.
How about this?
case p @ Project(_, l: LeafNode) if !l.isInstanceOf[OneRowRelation] => p Then, we do not need the first line.
There was a problem hiding this comment.
Yea. As I posted before. I added a new rule that has side-effect to fix this issue too.
There was a problem hiding this comment.
Thanks @viirya @cloud-fan !
I am not sure which way is better.
case p @ Project(_, l: LeafNode) if !l.isInstanceOf[OneRowRelation] => p My concern is the above line looks more hacky than the current PR fix.
There was a problem hiding this comment.
Let me respond the original question by @cloud-fan
We will not see an empty Project, if the child has more than one columns. The empty Project only happens after PruningColumns. I am fine, if we want to add an extra rule for eliminating Project only.
There was a problem hiding this comment.
how about we just move that case ahead? It seems always safe to apply case p @ Project(projectList, child) if sameOutput(child.output, p.output) => child
There was a problem hiding this comment.
I thought we intentionally did it in this way. I am not 100% sure if we might hit any issue because of it. Let me try it and check if we will hit any test case failure.
|
Test build #52740 has finished for PR 11599 at commit
|
|
nit: we need to update the title and description. Technically we can't remove |
|
Done. The title and PR description are corrected. Thanks! |
|
Thanks, merging to master. |
#### What changes were proposed in this pull request? As shown in another PR: apache#11596, we are using `SELECT 1` as a dummy table, when the table is used for SQL statements in which a table reference is required, but the contents of the table are not important. For example, ```SQL SELECT value FROM (select 1) dummyTable Lateral View explode(array(1,2,3)) adTable as value ``` Before the PR, the optimized plan contains a useless `Project` after Optimizer executing the `ColumnPruning` rule, as shown below: ``` == Analyzed Logical Plan == value: int Project [value#22] +- Generate explode(array(1, 2, 3)), true, false, Some(adtable), [value#22] +- SubqueryAlias dummyTable +- Project [1 AS 1#21] +- OneRowRelation$ == Optimized Logical Plan == Generate explode([1,2,3]), false, false, Some(adtable), [value#22] +- Project +- OneRowRelation$ ``` After the fix, the optimized plan removed the useless `Project`, as shown below: ``` == Optimized Logical Plan == Generate explode([1,2,3]), false, false, Some(adtable), [value#22] +- OneRowRelation$ ``` This PR is to remove `Project` when its Child's output is Nil #### How was this patch tested? Added a new unit test case into the suite `ColumnPruningSuite.scala` Author: gatorsmile <gatorsmile@gmail.com> Closes apache#11599 from gatorsmile/projectOneRowRelation.
What changes were proposed in this pull request?
As shown in another PR: #11596, we are using
SELECT 1as a dummy table, when the table is used for SQL statements in which a table reference is required, but the contents of the table are not important. For example,Before the PR, the optimized plan contains a useless
Projectafter Optimizer executing theColumnPruningrule, as shown below:After the fix, the optimized plan removed the useless
Project, as shown below:This PR is to remove
Projectwhen its Child's output is NilHow was this patch tested?
Added a new unit test case into the suite
ColumnPruningSuite.scala